AITopics | short utterance

Collaborating Authors

short utterance

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

VoiceExtender: Short-utterance Text-independent Speaker Verification with Guided Diffusion Model

He, Yayun, Kang, Zuheng, Wang, Jianzong, Peng, Junqing, Xiao, Jing

arXiv.org Artificial IntelligenceOct-6-2023

Speaker verification (SV) performance deteriorates as utterances become shorter. To this end, we propose a new architecture called VoiceExtender which provides a promising solution for improving SV performance when handling short-duration speech signals. We use two guided diffusion models, the built-in and the external speaker embedding (SE) guided diffusion model, both of which utilize a diffusion model-based sample generator that leverages SE guidance to augment the speech features based on a short utterance. Extensive experimental results on the VoxCeleb1 dataset show that our method outperforms the baseline, with relative improvements in equal error rate (EER) of 46.1%, 35.7%, 10.4%, and 5.7% for the short utterance conditions of 0.5, 1.0, 1.5, and 2.0 seconds, respectively.

diffusion model, utterance, voiceextender, (15 more...)

arXiv.org Artificial Intelligence

2310.04681

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.87)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

Meta-Learning for Short Utterance Speaker Recognition with Imbalance Length Pairs

Kye, Seong Min, Jung, Youngmoon, Lee, Hae Beom, Hwang, Sung Ju, Kim, Hoirin

arXiv.org Machine LearningApr-6-2020

In realistic settings, a speaker recognition system needs to identify a speaker given a short utterance, while the utterance used to enroll may be relatively long. However, existing speaker recognition models perform poorly with such short utterances. To solve this problem, we introduce a meta-learning scheme with imbalance length pairs. Specifically, we use a prototypical network and train it with a support set of long utterances and a query set of short utterances. However, since optimizing for only the classes in the given episode is not sufficient to learn discriminative embeddings for other classes in the entire dataset, we additionally classify both support set and query set against the entire classes in the training set to learn a well-discriminated embedding space. By combining these two learning schemes, our model outperforms existing state-of-the-art speaker verification models learned in a standard supervised learning framework on short utterance (1-2 seconds) on VoxCeleb dataset. We also validate our proposed model for unseen speaker identification, on which it also achieves significant gain over existing approaches.

recognition, short utterance, utterance, (15 more...)

arXiv.org Machine Learning

2004.02863

Country: Asia > South Korea (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (0.84)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.72)

Add feedback

Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances

Gusev, Aleksei, Volokhov, Vladimir, Andzhukaev, Tseren, Novoselov, Sergey, Lavrentyeva, Galina, Volkova, Marina, Gazizullina, Alice, Shulipa, Andrey, Gorlanov, Artem, Avdeeva, Anastasia, Ivanov, Artem, Kozlov, Alexander, Pekhovsky, Timur, Matveev, Yuri

arXiv.org Machine LearningFeb-14-2020

Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions according to the results obtained for early NIST SRE (Speaker Recognition Evaluation) datasets. From the practical point of view, taking into account the increased interest in virtual assistants (such as Amazon Alexa, Google Home, AppleSiri, etc.), speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks. This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances. For these purposes, we considered deep neural network architectures based on TDNN (TimeDelay Neural Network) and ResNet (Residual Neural Network) blocks. We experimented with state-of-the-art embedding extractors and their training procedures. Obtained results confirm that ResNet architectures outperform the standard x-vector approach in terms of speaker verification quality for both long-duration and short-duration utterances. We also investigate the impact of speech activity detector, different scoring models, adaptation and score normalization techniques. The experimental results are presented for publicly available data and verification protocols for the VoxCeleb1, VoxCeleb2, and VOiCES datasets.

extractor, protocol, voices, (16 more...)

arXiv.org Machine Learning

2002.06033

Country:

Asia > Russia (0.14)
Europe > Austria > Styria > Graz (0.05)
Europe > Sweden > Stockholm > Stockholm (0.05)
(14 more...)

Genre: Research Report > New Finding (0.88)

Industry: Information Technology > Security & Privacy (0.66)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (0.83)

Add feedback

Short utterance compensation in speaker verification via cosine-based teacher-student learning of speaker embeddings

Jung, Jee-weon, Heo, Hee-soo, Shim, Hye-jin, Yu, Ha-jin

arXiv.org Artificial IntelligenceApr-10-2019

The short duration of an input utterance is one of the most critical threats that degrade the performance of speaker verification systems. This study aimed to develop an integrated text-independent speaker verification system that inputs utterances with short duration of 2 seconds or less. We propose an approach using a teacher-student learning framework for this goal, applied to short utterance compensation for the first time in our knowledge. The core concept of the proposed system is to conduct the compensation throughout the network that extracts the speaker embedding, mainly in phonetic-level, rather than compensating via a separate system after extracting the speaker embedding. In the proposed architecture, phonetic-level features where each feature represents a segment of 130 ms are extracted using convolutional layers. A layer of gated recurrent units extracts an utterance-level feature using phonetic-level features. The proposed approach also adopts a new objective function for teacher-student learning that considers both Kullback-Leibler divergence of output layers and cosine distance of speaker embeddings layers. Experiments were conducted using deep neural networks that take raw waveforms as input, and output speaker embeddings on VoxCeleb1 dataset. The proposed model could compensate approximately 65 \% of the performance degradation due to the shortened duration.

artificial intelligence, machine learning, utterance, (19 more...)

arXiv.org Artificial Intelligence

1810.10884

Country:

Asia > South Korea > Seoul > Seoul (0.04)
Asia > India (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Deep neural network based i-vector mapping for speaker verification using short utterances

Guo, Jinxi, Xu, Ning, Qian, Kailun, Shi, Yang, Xu, Kaiyuan, Wu, Yingnian, Alwan, Abeer

arXiv.org Machine LearningOct-16-2018

Text-independent speaker recognition using short utterances is a highly challenging task due to the large variation and content mismatch between short utterances. I-vector based systems have become the standard in speaker verification applications, but they are less effective with short utterances. In this paper, we first compare two state-of-the-art universal background model training methods for i-vector modeling using full-length and short utterance evaluation tasks. The two methods are Gaussian mixture model (GMM) based and deep neural network (DNN) based methods. The results indicate that the I-vector_DNN system outperforms the I-vector_GMM system under various durations. However, the performances of both systems degrade significantly as the duration of the utterances decreases. To address this issue, we propose two novel nonlinear mapping methods which train DNN models to map the i-vectors extracted from short utterances to their corresponding long-utterance i-vectors. The mapped i-vector can restore missing information and reduce the variance of the original short-utterance i-vectors. The proposed methods both model the joint representation of short and long utterance i-vectors by using autoencoder. Experimental results using the NIST SRE 2010 dataset show that both methods provide significant improvement and result in a max of 28.43% relative improvement in Equal Error Rates from a baseline system, when using deep encoder with residual blocks and adding an additional phoneme vector. When further testing the best-validated models of SRE10 on the Speaker In The Wild dataset, the methods result in a 23.12% improvement on arbitrary-duration (1-5 s) short-utterance conditions.

artificial intelligence, machine learning, utterance, (16 more...)

arXiv.org Machine Learning

1810.07309

Country: North America > United States > California > Los Angeles County > Los Angeles (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback